In this lesson, we cover clustering theory, and distance & similarity measures. We use R apply and evaluate various clustering techniques such as hierarchical clustering, k-means clustering and PAM. We look at techniques to help choose the number of clusters as well as how to evaluating cluster performance.
To run the code you may need additional packages.
ggplot2, cluster and amap packages.install.packages("ggplot2");
install.packages("cluster");
install.packages("amap");
install.packages("useful");
require(ggplot2)
## Loading required package: ggplot2
require(cluster)
## Loading required package: cluster
require(amap)
## Loading required package: amap
require(useful)
## Loading required package: useful
We’ll be using GDP per capita, life expectancy, infant.mortality, and literacy data made availble by the WorldBank data.worldbank.org
GDP per capita (current US$)
GDP per capita is gross domestic product divided by midyear population. GDP is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in current U.S. dollars.
Life expectancy at birth, total (years)
Life expectancy at birth indicates the number of years a newborn infant would live if prevailing patterns of mortality at the time of its birth were to stay the same throughout its life. Derived from male and female life expectancy at birth from sources such as: (1) United Nations Population Division. World Population Prospects, (2) United Nations Statistical Division. Population and Vital Statistics Report (various years), (3) Census reports and other statistical publications from national statistical offices, (4) Eurostat: Demographic Statistics, (5) Secretariat of the Pacific Community: Statistics and Demography Programme, and (6) U.S. Census Bureau: International Database.
Mortality rate, infant (per 1,000 live births)
Infant mortality rate is the number of infants dying before reaching one year of age, per 1,000 live births in a given year. Estimates developed by the UN Inter-agency Group for Child Mortality Estimation (UNICEF, WHO, World Bank, UN DESA Population Division) at www.childmortality.org.
Literacy rate, adult total (% of people ages 15 and above)
Adult (15+) literacy rate (%). Total is the percentage of the population age 15 and above who can, with understanding, read and write a short, simple statement on their everyday life. Generally, ‘literacy’ also encompasses ‘numeracy’, the ability to make simple arithmetic calculations. This indicator is calculated by dividing the number of literates aged 15 years and over by the corresponding age group population and multiplying the result by 100.
We will also be using Francis Galton’s analysis of the heights of sons and fathers. Heights of sons of both tall and short fathers appeared to “revert” or “regress” to the mean of the group.
We will also be using the Wholesale customers Data Set (Data used in Hierarchical Clustering). The data set refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories.
Link: https://archive.ics.uci.edu/ml/machine-learning-databases/00292/ Source: UCI Machine Learning Repository : Data Set: Wholesale customer data.csv
The data has been modified so that each row (expenses) has been simply identified as a different Region (Region 1 to 49).
# Load our data
data_url <- 'http://54.198.163.24/YouTube/MachineLearning/M04/Galton_heights_sons_and_fathers.csv'
galton <- read.csv(url(data_url))
data_url <- 'http://54.198.163.24/YouTube/MachineLearning/M04/AnnualSpending.csv'
spend <- read.csv(url(data_url))
data_url <- 'http://54.198.163.24/YouTube/MachineLearning/M04/data.worldbank.org.csv'
dwb <- read.csv(url(data_url),header=TRUE,na.strings=c("NA","..", "?"))
Francis Galton’s analysis of the heights of sons and fathers .
head(galton)
## Family Father Mother Gender Height Kids
## 1 1 78.5 67.0 M 73.2 4
## 2 1 78.5 67.0 F 69.2 4
## 3 1 78.5 67.0 F 69.0 4
## 4 1 78.5 67.0 F 69.0 4
## 5 2 75.5 66.5 M 73.5 4
## 6 2 75.5 66.5 M 72.5 4
h.gend<-galton[c("Height","Gender")]
head(h.gend)
## Height Gender
## 1 73.2 M
## 2 69.2 F
## 3 69.0 F
## 4 69.0 F
## 5 73.5 M
## 6 72.5 M
You can also
data.worldbank.org
head(dwb)
## Country Country.Code Region
## 1 Afghanistan AFG South Asia
## 2 Albania ALB Europe & Central Asia
## 3 Algeria DZA Middle East & North Africa
## 4 American Samoa ASM East Asia & Pacific
## 5 Andorra ADO Europe & Central Asia
## 6 Angola AGO Sub-Saharan Africa
## Income.Group Per.capita.income Literacy Life.expectancy
## 1 Low income 590.2695 NA 60.37446
## 2 Upper middle income 3965.0168 NA 77.83046
## 3 Upper middle income 4206.0312 NA 74.80810
## 4 Upper middle income NA NA NA
## 5 High income NA NA NA
## 6 Upper middle income 4102.1186 70.77841 52.26688
## Infant.mortality
## 1 66.3
## 2 12.5
## 3 21.9
## 4 NA
## 5 2.1
## 6 96.0
nrow(dwb) - nrow(na.omit(dwb)) # There are some rows with at neast one NA
## [1] 172
wb<-dwb[c("Country","Life.expectancy","Infant.mortality","Per.capita.income","Literacy")]
head(wb)
## Country Life.expectancy Infant.mortality Per.capita.income
## 1 Afghanistan 60.37446 66.3 590.2695
## 2 Albania 77.83046 12.5 3965.0168
## 3 Algeria 74.80810 21.9 4206.0312
## 4 American Samoa NA NA NA
## 5 Andorra NA 2.1 NA
## 6 Angola 52.26688 96.0 4102.1186
## Literacy
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 70.77841
wb<-na.omit(wb) # Clustering won't work with NA or Inf values
# We remove the NA's here to prevent an NA in a column of no interest removing a row that has otherwise good data
wb.Country<-wb$Country # We need the country names for formating and confusion matrices
wb[["Country"]] <- NULL
head(wb)
## Life.expectancy Infant.mortality Per.capita.income Literacy
## 6 52.26688 96.0 4102.119 70.77841
## 7 75.93763 5.8 14128.879 98.95000
## 13 70.76322 27.9 5496.345 99.78936
## 16 71.62590 30.7 1211.702 59.72154
## 25 76.43324 5.1 4197.807 98.26413
## 26 64.42924 34.8 6360.645 87.32057
nrow(wb) # Make sure there are enough records after removing NA
## [1] 45
Wholesale customers Data Set
head(spend)
## Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 1 12669 9656 7561 214 2674 1338
## 2 2 7057 9810 9568 1762 3293 1776
## 3 3 6353 8808 7684 2405 3516 7844
## 4 4 13265 1196 4221 6404 507 1788
## 5 5 22615 5410 7198 3915 1777 5185
## 6 6 9413 8259 5126 666 1795 1451
What is Clustering? Clustering is grouping like with like such that:
cluster analysis
cluster analysis
There are two primary approaches to measure the “closeness” of data, distance and similarity. Distance measures are based in the notion of a metric space.
Similarity measures s a real-valued function that quantifies the similarity between two objects. While no single definition of a similarity measure exists, they don’t necessarily have the constraint of being a metric space. For example, cosine similarity is a measure of the “closeness” of two vectors that is not a metric space.
A metric space is an ordered pair \((M,d)\) where \(M\) is a set and \(d\) is a distance (i.e.metric) on \(M\), i.e., a function:
\(d \colon M \times M \rightarrow \mathbb{R}\)
such that for any \(x, y, z \in M\), the following holds:
Examples of Metric Spaces are Chebyshev distance, Euclidean distance, Hamming distance, Minkowski distance and many others.
Euclidean distance is the most common metric for measuring the distance between two vectors. The is the stadard Cartesian coordinates. That is, if \(p = (p_1, p_2,..., p_n)\) and \(q = (q_1, q_2,..., q_n)\) are two points in Euclidean n-space, then the distance (d) from p to q, or from q to p is given by the Pythagorean formula:
\[\begin{align}\mathrm{d}(\mathbf{p},\mathbf{q}) = \mathrm{d}(\mathbf{q},\mathbf{p}) & = \sqrt{(q_1-p_1)^2 + (q_2-p_2)^2 + \cdots + (q_n-p_n)^2} \\[8pt] & = \sqrt{\sum_{i=1}^n (q_i-p_i)^2}.\end{align}\]
Similarity (or dimilarity measures) measure closeness without the constraints and benefits of being a formal metric space.
The most common form of the similarity measure is the vector inner product (or cosine similarity) Given vectors A and B, the vector inner product can be defined using the Euclidean dot product formula:
\[\mathbf{a}\cdot\mathbf{b} =\left\|\mathbf{a}\right\|\left\|\mathbf{b}\right\|\cos\theta\]
This similarity measure can also be ranged normalized. Alternately, we can normalize this measure by dividing each vector component by the magnitude of the vector.
Correlation based similarity is usually the Pearson correlation. Pearson product-moment correlation coefficient commonly represented by the Greek letter \(\rho\) (rho) and is defined as:
\[ \rho_{X,Y}= \frac{\operatorname{cov}(X,Y)}{\sigma_X \sigma_Y} \]
where \(\operatorname{cov}\) is the covariance and \(\sigma_X\) is the standard deviation of \(X\).
The formula for \(\rho\)can be expressed in terms of mean and expectation.
\[\operatorname{cov}(X,Y) = E[(X-\mu_X)(Y-\mu_Y)]\]
\[ M= \begin{bmatrix} x_{1,1} & x_{1,2} & x_{1,3} & ... & x_{1,n} \\ x_{2,1} & x_{2,2} & x_{2,3} & ... & x_{2,n} \\ x_{3,1} & x_{3,2} & x_{3,3} & ... & x_{3,n} \\ .. & .. & .. & ... & .. \\ x_{n,1} & x_{n,2} & x_{n,3} & ... & x_{n,n} \\ \end{bmatrix} \]
Distance matrix
\[ M= \begin{bmatrix} x_{1,1} & & & & \\ x_{2,1} & 0 & & & \\ x_{3,1} & x_{3,2} & 0 & & \\ .. & .. & .. & ... & .. \\ x_{n,1} & x_{n,2} & x_{n,3} & ... & 0 \\ \end{bmatrix} \]
The dissimilarity/similarity matrix is calculated by iterating over each element and calculating its dissimilarity/similarity to every other element. Let A be a Dissimilarity Matrix of size \(NxN\), and \(B\) a set of \(N\) elements. \(A_{i,j}\) is the dissimilarity/similarity between elements \(B_i\) and \(B_j\).
for i = 0 to N do
for j = 0 to N do
Aij = Dissimilarity(Bi,Bj) // or Similarity(Bi,Bj)
end-for
end-for
where the dissimilarity/similarity matrix is usually defined as follows:
\[ M= \begin{bmatrix} 0 & & & & \\ d(2,1) & 0 & & & \\ d(3,1) & d(3,2) & 0 & & \\ .. & .. & .. & ... & .. \\ d(n,1) & d(n,2) & d(n,3) & ... & 0 \\ \end{bmatrix} \]
Clustering (e.g., k-means, mixture models, hierarchical clustering). Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). There are various types of cluster analysis.
Partitioning algorithms: Construct various partitions and then evaluate them by some criterion Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion Density-based: based on connectivity and density functions Grid-based: based on a multiple-level granularity structure Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other
Partitioning method: Construct a partition of n documents into a set of K clusters Given: a set of documents and the number K Find: a partition of K clusters that optimizes the chosen partitioning criterion Globally optimal Intractable for many objective functions Ergo, exhaustively enumerate all partitions Effective heuristic methods: K-means and K-medoids algorithms
The term “k-means” was first used by James MacQueen in 1967,[1] though the idea goes back to Hugo Steinhaus in 1957. Given a desired \(k\) clusters and n data points \(X = {x_1, x_2, …, x_n}\):
Initialize centroids \(\mu_1, \mu_1, ... \mu_k \quad \in \quad \mathbb{R}^n\) (usually randomly)
while (not coverged):
Step A (Assignment step):
Find the closest cluster to every point in \(X = {x_1, x_2, …, x_n}\)
That is,
\[\underset{\mathbf{X}} {\operatorname{arg\,min}} \sum_{i=1}^{k} \sum_{\mathbf x \in X_i} \left\| \mathbf x - \boldsymbol\mu_i \right\|^2\]
where \(\mu_i\) is the mean of points in \(X_i\).
\[ S_i^{(t)} = \big \{ x_p : \big \| x_p - m^{(t)}_i \big \|^2 \le \big \| x_p - m^{(t)}_j \big \|^2 \ \forall j, 1 \le j \le k \big\} \]
Step B (Update step):
Calculate the new means to be the centroids of the observations in the new clusters.
\[ m^{(t+1)}_i = \frac{1}{|X^{(t)}_i|} \sum_{x_j \in X^{(t)}_i} x_j \]
The centroid of a finite set of {k} points
\[ \mathbf{x}_1,\mathbf{x}_2,\ldots,\mathbf{x}_k in \mathbb{R}^n \quad is\quad \mathbf{C} = \frac{\mathbf{x}_1+\mathbf{x}_2+\cdots+\mathbf{x}_k}{k} \]
Demonstration of the standard algorithm (from k-means algorithm
k initial controids
find nearest cluster
new centroid
k-means algorithm converges
For problems with k-means clustering see K-means clustering is not a free lunch
The K-medoids or Partitioning Around Medoids (PAM) algorithm (Kaufman & Rousseeuw’87) is related to the k-means algorithm and but uses medoid shifts rather than reassigning points based on Euclidean distance. Each cluster is represented by one of the objects (i.e. points) in the cluster A medoid is a point in a cluster whose dissimilarity to all the points in the cluster is minimal. Medoids are similar in concept to means or centroids, but medoids are always members of the data set. That is, in 2D Cartesian space a centroid can be any valid x.y coordinate. Whereas a medoid must be one of the data points. (The data point least dissimilar to the rest.)
Pseduocode for the k-medoid clustering (Partitioning Around Medoids (PAM)) algorithm:
Initialize: randomly select[citation needed] (without replacement) k of the n data points as the medoids
Associate each data point to the closest medoid.
While the cost of the configuration decreases:
For each medoid m, for each non-medoid data point o:
Swap m and o, recompute the cost
If the total cost of the configuration increased in the previous step, undo the swap.
In hierarchical clustering the idea is to group data objects (i.e. points) into a tree of clusters. That is, hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters.
These trees (hierarchies) generally fall into two types:
Initially each data object (i.e. point) in its own cluster. Iteratively the clusters are merged together from the “bottom-up.” The two most similar/closest objects are aggreated in to the same cluster/data object. Then the next two, until there is just one cluster/data object. This agglomerative approach result in “straggly” (long and thin) clusters due to a chaining effect. It is also sensitive to noise.
in divisive hierarchical clustering all data objects (i.e. points) are initially in one cluster. These clusters are successively divided recursivley in a “top-down” manner. The cluster is broken in to two clusters that are most dissimilar. Then each of those clusters is broken in to two cluster that are most dissimilar. This continues until each clsuter is a single data object (i.e. point).
Density-based clustering is similar to k-means clustering, except that it uses the expectation-maximization algorithm, to generate a likelihood that for each data objects (i.e. points) belong to a cluster. In the special case of a Gaussian mixture model, specifically, the limit of taking all covariances as diagonal, equal, and small. A k-means problem can be generalized into a Gaussian mixture model.
Density-based clustering uses a density estimator to learn a probalistic mapping from a set of attributes to a probability.
Input -> Density Estimator -> Probability
Given a data object x(i.e. point), a density estimator M can tell you how likely x belongs to cluster k.
Given a statistical model which generates a set \(\mathbf{X}\) of observed data (e.g. assuming it comes for a Gaussian distribution), a set of unobserved latent data or missing values \(\mathbf{Z}\), and a vector of unknown parameters \(\boldsymbol\theta\), along with a likelihood function \(L(\boldsymbol\theta; \mathbf{X}, \mathbf{Z}) = p(\mathbf{X}, \mathbf{Z}|\boldsymbol\theta),\) the maximum likelihood estimate (MLE) of the unknown parameters is determined by the marginal likelihood of the observed data \(L(\boldsymbol\theta; \mathbf{X}) = p(\mathbf{X}|\boldsymbol\theta) = \sum_{\mathbf{Z}} p(\mathbf{X},\mathbf{Z}|\boldsymbol\theta)\)
The EM algorithm seeks to find the MLE of the marginal likelihood by iteratively applying the following two steps:
Expectation step (E step): Calculate the expected value of the log likelihood function, with respect to the conditional distribution of \(\mathbf{Z}\) given \(\mathbf{X}\) under the current estimate of the parameters
\[\boldsymbol\theta^{(t)}: Q(\boldsymbol\theta|\boldsymbol\theta^{(t)}) = \operatorname{E}_{\mathbf{Z}|\mathbf{X},\boldsymbol\theta^{(t)}}\left[ \log L(\boldsymbol\theta;\mathbf{X},\mathbf{Z}) \right] \]
Maximization step (M step): Find the parameter that maximizes this quantity:
\[ \boldsymbol\theta^{(t+1)} = \underset{\boldsymbol\theta}{\operatorname{arg\,max}} Q(\boldsymbol\theta|\boldsymbol\theta^{(t)}) \]
Evaluation of clustering results sometimes is referred to as cluster validation.
Cluster Cohesion: Measures how closely related are objects in a cluster.
Cluster Separation: Measure how distinct or well-separated a cluster is from other clusters.
The quality of a clustering result depends on both the similarity measure used by the method and its implementation
There are a number of packages for plotting clusters:
We’ll focus on dendrograms, multidimensional scaling (MDS) and plotting a confusion matrix.
A dendrogram is a tree data structure which illustrates hierarchical (or taxonomic) relationships
A dendrogram
A dendrogram
Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a high-dimenional dataset.
Given a dataset of I objects (colors, faces, stocks, . . .) on which a distance function is defined, \(\delta_{i,j} :=\) distance between \(i-th\) and \(j-th\) objects. These distances are the entries of the dissimilarity matrix \[ \Delta := \begin{pmatrix} \delta_{1,1} & \delta_{1,2} & \cdots & \delta_{1,I} \\ \delta_{2,1} & \delta_{2,2} & \cdots & \delta_{2,I} \\ \vdots & \vdots & & \vdots \\ \delta_{I,1} & \delta_{I,2} & \cdots & \delta_{I,I} \end{pmatrix}. \]
The goal of MDS is, given \(\Delta\), to find I vectors \(x_1,\ldots,x_I \in \mathbb{R}^N\) such that \(\|x_i - x_j\| \approx \delta_{i,j}\) for all \(i,j\in {1,\dots,I}\), where \(\|\cdot\|\) is a vector norm. In classical MDS, this norm is the Euclidean distance, but, it may be a metric or arbitrary distance function.
In other words, MDS attempts to find an embedding from the I objects into \(\mathbb{R}^N\) such that distances are preserved.
multidimensional scaling
An example of classical multidimensional scaling applied to voting patterns in the United States House of Representatives
confusion plot is a plot of the confusion matrix. A confusion matrix, also known as a contingency table or an error matrix.
A confusion matrix is a 2x2 table with counts of the following in each of its quadrents.
true positive (TP)
eqv. with hit
true negative (TN)
eqv. with correct rejection
false positive (FP)
eqv. with false alarm, Type I error
false negative (FN)
eqv. with miss, Type II error
Cluster evaluation metics quantitate within cluster cohesion and between cluster Separation.
Compute can compute the correlation between the two matrices (a proximity patrix and “incidence” matrix). Since the matrices are symmetric, only the correlation between \(frac{n(n-1)}{2}\) entries needs to be calculated. (ie. above or below the diagonal).
High correlation indicates that points that belong to the same cluster are close to each other. This is not a good measure for some density or contiguity based clusters.
Proximity Matrix
A proximity is a measurement of the similarity or dissimilarity, broadly defined, of a pair of objects. If measured for all pairs of objects in a set (e.g. driving distances among a set of U.S. cities), the proximities are represented by an object-by-object proximity matrix.
“Incidence” Matrix
An “incidence” matrix
One can sort the similarity matrix with respect to cluster labels amd then plot.
The Davies–Bouldin index can be represented by the formula below:
\[ DB = \frac {1} {n} \sum_{i=1}^{n} \max_{j\neq i}\left(\frac{\sigma_i + \sigma_j} {d(c_i,c_j)}\right) \]
where n is the number of clusters,\(c_x\) is the centroid of cluster x, \(\sigma_x\) is the average distance of all elements in cluster x to centroid \(c_x\), and \(d(c_i,c_j)\) is the distance between centroids \(c_i\) and \(c_j\).
Clusters with low intra-cluster distances (high intra-cluster similarity) and high inter-cluster distances (low inter-cluster similarity) will have a low Davies–Bouldin index.
The Dunn index aims to identify dense and well-separated clusters. It is defined as the ratio between the minimal inter-cluster distance to maximal intra-cluster distance. For each cluster partition, the Dunn index can be calculated by the following formula:
\[ D = \frac{\min_{1 \leq i < j \leq n} d(i,j)}{\max_{1 \leq k \leq n} d^{\prime}(k)} \, \]
where \(d(i,j)\) represents the distance between clusters \(i\) and \(j\), and \(d '(k)\) measures the intra-cluster distance of cluster \(k\). The inter-cluster distance \(d(i,j)\) between two clusters may be any number of distance measures, such as the distance between the centroids of the clusters.
Clusters with high intra-cluster similarity and low inter-cluster similarity, produce clusters with high Dunn index (i.e. are more desirable).
The silhouette coefficient contrasts the average distance to elements in the same cluster with the average distance to elements in other clusters. Objects with a high silhouette value are considered well clustered, objects with a low value may be outliers. This index works well with k-means clustering, and is also used to determine the optimal number of clusters.
Simple measure: purity, the ratio between the dominant class in the cluster \(c_i\) and the size of cluster\(c_i\). To compute purity , each cluster is assigned to the class which is most frequent in the cluster, and then the accuracy of this assignment is measured by counting the number of correctly assigned documents and dividing by \(N\). Formally:
\[ \mbox{purity}(\mathbb{C_i}) = \frac{1}{n_i} \sum_k \max_j (n_{i,j}) \quad j \in \mathbb{C} \]
Note that this is biased because having n clusters for n data objects maximizes purity.
The notion of entropy can be used to evauluate the “stickness” or mutual information between classes and clusters They can measure how much knowing one of these variables reduces uncertainty about the other. In a sense we want to measure how “sticky” within cluster data is and how “non-sticky” the between cluster data is. There are several ways to use entropy to measure “stickness.”
Mutual information quantifies the mutual dependence of the two random variables. It is a measure of the “stickiness” between two items. It measures how much knowing one of these variables reduces uncertainty about the other. We can use mutual information to quantify the association between two tags. Mutual information is given by: $$ I(X;Y) = {y Y} {x X} p(x,y) , ,!
$$
where \(p(x,y)\) is the joint probability distribution function of \(X\) and \(Y\), and \(p(x)\) and \(p(y)\) are the marginal probability distribution functions of \(X\) and \(Y\) respectively.
Kullback-Leibler divergence is a non-symmetric measure of the difference between two probability distributions. The Kullback-Leibler measure goes by several names: relative entropy, discrimination information, Kullback-Leibler (KL) number, directed divergence, informational divergence, and cross entropy. Kullback-Leibler divergence is a measure of the difference between the observed entropy and its excepted entropy. We calculate the KL divergence by weighting one distribution (like an observed frequency distribution) by the log of probabilities of some other distribution. For discrete probability distributions P and Q, the Kullback–Leibler divergence of Q from P is defined to be:
\[ D_{\mathrm{KL}}(P\|Q) = \sum_i P(i) \, \ln\frac{P(i)}{Q(i)} \]
In words, it is the expectation of the logarithmic difference between the probabilities P and Q, where the expectation is taken using the probabilities P.
The Jaccard index, also known as the Jaccard similarity coefficient (originally coined coefficient de communauté by Paul Jaccard), is a statistic used for comparing the similarity and diversity of sample sets. If \(\mathbf{x} = (x_1, x_2, \ldots, x_n) and \mathbf{y} = (y_1, y_2, \ldots, y_n\)) are two vectors with all real \(x_i, y_i \geq 0\), then their Jaccard similarity coefficient is defined as
\[ J(\mathbf{x}, \mathbf{y}) = \frac{\sum_i \min(x_i, y_i)}{\sum_i \max(x_i, y_i)}, \]
and Jaccard distance
\[ d_J(\mathbf{x}, \mathbf{y}) = 1 - J(\mathbf{x}, \mathbf{y}). \]
The Rand index or Rand measure (named after William M. Rand) in statistics, and in particular in data clustering, is a measure of the similarity between two data clusterings. A form of the Rand index may be defined that is adjusted for the chance grouping of elements, this is the adjusted Rand index.
Given a set of \(n\) elements \(S = \{o_1, \ldots, o_n\}\) and two partitions of \(S\) to compare, \(X = \{X_1, \ldots, X_r\}\), a partition of \(S\) into \(r\) subsets, and \(Y = \{Y_1, \ldots, Y_s\}\), a partition of \(S\) into \(s\) subsets, define the following:
The Rand index, R, is:
\[ RI = \frac{a+b}{a+b+c+d} = \frac{a+b}{{n \choose 2 }} \]
Intuitively, \(a + b\) can be considered as the number of agreements between \(X\) and \(Y\) and \(c + d\) as the number of disagreements between \(X\) and \(Y\).
Rand Index
The Rand index can be used to create a cluster F-measure.
\[ RI = \frac{a+b}{a+b+c+d} \]
Compare with standard Precision and Recall:
\[ Precision = \frac{a}{a+c} \]
\[ Recall = \frac{a}{a+d} \]
Compare with a confusion matrix:
A confusion matrix is a 2x2 table with counts of the following in each of its quadrents.
true positive (TP)
eqv. with hit
true negative (TN)
eqv. with correct rejection
false positive (FP)
eqv. with false alarm, Type I error
false negative (FN)
eqv. with miss, Type II error
K-means clustering in R. Can we cluster gender by height using the Galton regression data?
qplot(x=Height, data=h.gend, geom="density", group=Gender, color=Gender)
k<-2
galton.2.clust<-kmeans(h.gend[,c("Height")],k)
galton.2.clust
## K-means clustering with 2 clusters of sizes 506, 392
##
## Cluster means:
## [,1]
## 1 64.13775
## 2 70.14643
##
## Clustering vector:
## [1] 2 2 2 2 2 2 1 1 2 2 2 2 1 1 1 2 2 2 1 1 1 2 2 2 2 2 2 1 2 2 1 1 1 2 2
## [36] 2 1 1 1 1 1 1 2 1 2 2 1 2 2 2 2 2 2 1 1 1 2 2 2 1 1 1 1 1 1 1 2 2 2 2
## [71] 2 2 2 1 2 2 2 2 2 1 2 2 2 1 1 1 1 1 1 1 2 2 2 1 1 2 1 1 2 1 2 2 1 1 2
## [106] 2 1 1 2 2 2 1 1 1 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 1 2 2 2 1 2 1 1 1 2 2
## [141] 2 1 1 1 2 1 2 2 2 1 1 2 2 2 2 2 1 1 2 1 2 2 2 2 1 2 2 2 1 1 1 1 1 2 2
## [176] 2 1 2 2 2 2 2 2 1 1 1 1 2 1 2 1 1 1 1 2 2 2 1 1 1 1 1 1 2 2 2 1 2 2 1
## [211] 1 1 2 2 2 1 1 2 1 1 1 1 2 2 2 2 1 1 1 1 2 2 2 2 2 2 2 2 2 1 1 1 2 2 2
## [246] 1 1 1 2 2 2 2 1 2 2 1 1 1 1 1 2 2 2 2 2 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1
## [281] 1 1 2 2 1 1 1 1 2 2 2 2 1 1 1 2 2 1 2 2 2 2 2 1 1 1 1 2 2 2 1 1 1 1 2
## [316] 2 1 1 2 2 1 1 1 2 2 2 2 2 2 1 1 1 2 1 1 1 2 2 2 2 2 2 2 1 1 2 1 1 1 1
## [351] 1 1 1 2 2 1 1 1 2 2 2 1 2 1 1 1 2 1 1 1 2 2 2 2 2 1 1 1 2 2 2 1 1 1 1
## [386] 2 2 1 2 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 2 2 2 2 1
## [421] 1 1 1 2 2 2 2 2 2 2 1 1 2 2 2 1 1 1 2 2 2 2 1 2 2 1 1 2 2 2 1 1 1 2 2
## [456] 2 2 1 1 1 2 2 2 2 1 1 1 1 1 2 2 1 1 1 1 1 2 2 1 1 1 1 1 2 1 1 1 2 2 1
## [491] 2 2 2 2 1 1 1 2 1 1 1 1 1 1 2 1 1 1 2 2 2 2 2 2 2 1 2 2 2 2 1 1 1 1 2
## [526] 2 1 1 2 2 2 1 1 2 2 2 1 1 1 1 1 1 2 2 1 2 1 1 1 2 2 1 2 2 1 2 2 2 2 2
## [561] 2 2 2 2 1 2 1 1 1 2 2 2 1 1 1 1 2 2 2 1 2 2 1 1 1 1 1 1 2 2 2 1 1 1 1
## [596] 1 1 1 1 1 1 1 2 2 2 2 1 1 2 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 2 2 2 1
## [631] 1 2 2 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 1 1 2 2 1 1 2 2 1 1 1
## [666] 1 1 1 1 1 1 1 1 2 1 1 1 2 2 1 1 1 1 1 1 1 1 1 2 2 2 1 1 2 2 2 1 1 1 1
## [701] 2 2 1 1 1 2 2 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 2 2 1 1 2 2 2 2 2 1 1 1
## [736] 2 2 1 2 2 2 2 1 1 2 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1
## [771] 1 1 2 2 1 1 1 1 1 1 2 2 1 1 1 2 1 1 2 1 1 1 1 1 2 2 1 1 1 1 1 2 2 1 1
## [806] 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1
## [841] 2 1 2 2 2 2 1 1 1 1 1 1 1 1 2 1 1 1 1 2 2 2 1 2 2 1 1 1 2 2 1 1 1 1 1
## [876] 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1
##
## Within cluster sum of squares by cluster:
## [1] 2133.809 1406.475
## (between_SS / total_SS = 69.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
Create confusion matrix and plot it. Note we can only do this becuase we know what the gender actually is. Note that runningh it twice can get different results.
cm<-table(h.gend$Gender,galton.2.clust$cluster) # Confustion Matrix
cm
##
## 1 2
## F 393 40
## M 113 352
plot(cm)
galton.2.clust<-kmeans(h.gend[,c("Height")],k)
galton.2.clust
## K-means clustering with 2 clusters of sizes 455, 443
##
## Cluster means:
## [,1]
## 1 63.81692
## 2 69.78420
##
## Clustering vector:
## [1] 2 2 2 2 2 2 1 1 2 2 2 2 2 1 1 2 2 2 1 1 1 2 2 2 2 2 2 1 2 2 1 1 1 2 2
## [36] 2 2 2 1 1 1 1 2 2 2 2 1 2 2 2 2 2 2 1 1 1 2 2 2 1 1 1 1 1 1 1 2 2 2 2
## [71] 2 2 2 1 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 2 2 2 1 1 2 2 1 2 2 2 2 1 1 2
## [106] 2 1 1 2 2 2 1 1 1 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2
## [141] 2 1 1 1 2 2 2 2 2 1 1 2 2 2 2 2 1 1 2 1 2 2 2 2 1 2 2 2 1 1 1 1 1 2 2
## [176] 2 2 2 2 2 2 2 2 1 1 1 1 2 1 2 2 1 1 1 2 2 2 1 1 1 1 1 1 2 2 2 1 2 2 1
## [211] 1 1 2 2 2 1 1 2 1 1 1 1 2 2 2 2 1 1 1 1 2 2 2 2 2 2 2 2 2 1 1 1 2 2 2
## [246] 1 1 1 2 2 2 2 1 2 2 1 1 1 1 1 2 2 2 2 2 1 1 1 1 2 2 2 1 1 1 1 1 2 1 1
## [281] 1 1 2 2 2 1 1 1 2 2 2 2 2 1 1 2 2 1 2 2 2 2 2 1 1 1 1 2 2 2 2 1 1 2 2
## [316] 2 1 1 2 2 1 1 1 2 2 2 2 2 2 1 1 1 2 1 1 1 2 2 2 2 2 2 2 1 1 2 2 1 1 1
## [351] 1 1 1 2 2 2 1 1 2 2 2 1 2 2 1 1 2 1 1 1 2 2 2 2 2 1 1 1 2 2 2 1 1 1 1
## [386] 2 2 1 2 2 2 1 1 1 1 1 2 1 1 2 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 2 2 2 2 1
## [421] 1 1 1 2 2 2 2 2 2 2 2 1 2 2 2 1 1 1 2 2 2 2 1 2 2 1 1 2 2 2 1 1 1 2 2
## [456] 2 2 1 1 1 2 2 2 2 1 1 1 1 1 2 2 2 1 1 1 1 2 2 1 1 1 1 1 2 1 1 1 2 2 1
## [491] 2 2 2 2 1 1 1 2 2 1 1 1 1 1 2 1 1 1 2 2 2 2 2 2 2 1 2 2 2 2 1 1 1 1 2
## [526] 2 1 1 2 2 2 1 1 2 2 2 1 1 1 1 1 1 2 2 1 2 1 1 1 2 2 1 2 2 2 2 2 2 2 2
## [561] 2 2 2 2 1 2 1 1 1 2 2 2 1 1 1 1 2 2 2 1 2 2 1 1 1 1 1 1 2 2 2 2 1 1 1
## [596] 1 1 1 1 1 1 1 2 2 2 2 1 1 2 2 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 2 2 2 1
## [631] 2 2 2 1 1 2 2 1 1 1 1 1 1 2 2 1 1 1 1 2 2 2 2 2 1 1 2 2 1 1 2 2 1 1 1
## [666] 1 1 1 1 1 1 1 1 2 2 1 1 2 2 1 1 1 1 1 1 1 1 1 2 2 2 1 1 2 2 2 1 1 1 1
## [701] 2 2 1 1 1 2 2 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 1 1
## [736] 2 2 1 2 2 2 2 1 2 2 2 2 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 2 1 1 2 2 1 1
## [771] 1 1 2 2 1 1 1 1 1 1 2 2 2 1 1 2 1 1 2 2 1 1 1 1 2 2 1 1 1 1 1 2 2 2 1
## [806] 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1
## [841] 2 1 2 2 2 2 1 1 2 2 1 1 1 1 2 1 1 1 1 2 2 2 1 2 2 2 2 1 2 2 1 1 1 1 1
## [876] 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1
##
## Within cluster sum of squares by cluster:
## [1] 1669.160 1853.249
## (between_SS / total_SS = 69.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
cm2<-table(h.gend$Gender,galton.2.clust$cluster) # Confustion Matrix
cm2
##
## 1 2
## F 376 57
## M 79 386
plot(cm2)
Clustering the World Bank data.
head(wb)
## Life.expectancy Infant.mortality Per.capita.income Literacy
## 6 52.26688 96.0 4102.119 70.77841
## 7 75.93763 5.8 14128.879 98.95000
## 13 70.76322 27.9 5496.345 99.78936
## 16 71.62590 30.7 1211.702 59.72154
## 25 76.43324 5.1 4197.807 98.26413
## 26 64.42924 34.8 6360.645 87.32057
k<-2
wb.2.clust<- kmeans(wb,k)
wb.2.clust
## K-means clustering with 2 clusters of sizes 43, 2
##
## Cluster means:
## Life.expectancy Infant.mortality Per.capita.income Literacy
## 1 70.92308 25.0814 8161.671 87.89852
## 2 80.62157 4.4500 63777.971 97.00900
##
## Clustering vector:
## 6 7 13 16 25 26 27 39 57 58 59 60 61 72 76 80 82 85
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 87 92 96 97 106 113 118 128 130 136 156 159 166 167 170 171 174 179
## 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1
## 185 191 197 199 200 204 208 209 214
## 1 1 1 1 1 1 1 1 1
##
## Within cluster sum of squares by cluster:
## [1] 2659103941 237150514
## (between_SS / total_SS = 67.1 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
k<-3
wb.3.clust<- kmeans(wb,k)
wb.3.clust
## K-means clustering with 3 clusters of sizes 2, 11, 32
##
## Cluster means:
## Life.expectancy Infant.mortality Per.capita.income Literacy
## 1 80.62157 4.450000 63777.97 97.00900
## 2 77.54424 6.763636 19890.37 98.12095
## 3 68.64705 31.378125 4129.93 84.38456
##
## Clustering vector:
## 6 7 13 16 25 26 27 39 57 58 59 60 61 72 76 80 82 85
## 3 2 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3
## 87 92 96 97 106 113 118 128 130 136 156 159 166 167 170 171 174 179
## 2 3 2 3 2 3 3 3 3 3 2 1 2 3 3 1 2 2
## 185 191 197 199 200 204 208 209 214
## 3 3 2 3 3 3 2 3 3
##
## Within cluster sum of squares by cluster:
## [1] 237150514 386675339 239076979
## (between_SS / total_SS = 90.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
Determining number of clusters. There is a number of methods for The issue of determining “the right number of clusters” including Hartigan’s rule, averaged Silhouette width and Gap statistic.
# Determining number of clusters
sos <- (nrow(wb)-1)*sum(apply(wb,2,var))
for (i in 2:10) sos[i] <- sum(kmeans(wb, centers=i)$withinss)
plot(1:10, sos, type="b", xlab="Number of Clusters", ylab="sum of squares")
plot(2:10, sos[c(2:10)], type="b", xlab="Number of Clusters", ylab="sum of squares")
# Hartigans's rule FitKMean (similarity)
# require(useful)
best<-FitKMeans(wb,max.clusters=10, seed=111)
PlotHartigan(best)
A k of 6?
head(wb)
## Life.expectancy Infant.mortality Per.capita.income Literacy
## 6 52.26688 96.0 4102.119 70.77841
## 7 75.93763 5.8 14128.879 98.95000
## 13 70.76322 27.9 5496.345 99.78936
## 16 71.62590 30.7 1211.702 59.72154
## 25 76.43324 5.1 4197.807 98.26413
## 26 64.42924 34.8 6360.645 87.32057
k<-6
wb.6.clust<- kmeans(wb,k)
wb.6.clust
## K-means clustering with 6 clusters of sizes 6, 15, 4, 2, 5, 13
##
## Cluster means:
## Life.expectancy Infant.mortality Per.capita.income Literacy
## 1 77.60843 7.75000 24383.696 97.62204
## 2 70.74951 25.16000 4968.196 88.90804
## 3 70.98357 26.42500 9449.684 93.86932
## 4 80.62157 4.45000 63777.971 97.00900
## 5 77.46720 5.58000 14498.383 98.71964
## 6 65.50221 40.07692 1525.854 76.24678
##
## Clustering vector:
## 6 7 13 16 25 26 27 39 57 58 59 60 61 72 76 80 82 85
## 2 5 2 6 2 2 3 6 2 2 2 2 3 2 5 2 6 6
## 87 92 96 97 106 113 118 128 130 136 156 159 166 167 170 171 174 179
## 5 2 1 2 1 2 2 3 6 6 5 4 1 6 6 4 1 1
## 185 191 197 199 200 204 208 209 214
## 6 6 1 3 2 6 5 6 6
##
## Within cluster sum of squares by cluster:
## [1] 97332955 16265503 3921650 237150514 22834895 6991191
## (between_SS / total_SS = 95.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
trails<-33
wb.6.clust.33<- kmeans(wb,k, nstart = trails)
wb.6.clust.33
## K-means clustering with 6 clusters of sizes 7, 13, 15, 3, 2, 5
##
## Cluster means:
## Life.expectancy Infant.mortality Per.capita.income Literacy
## 1 73.28554 17.328571 10954.456 96.17687
## 2 65.50221 40.076923 1525.854 76.24678
## 3 70.74951 25.160000 4968.196 88.90804
## 4 80.11777 4.566667 28221.092 97.57849
## 5 80.62157 4.450000 63777.971 97.00900
## 6 76.71376 9.020000 19049.671 97.76685
##
## Clustering vector:
## 6 7 13 16 25 26 27 39 57 58 59 60 61 72 76 80 82 85
## 3 1 3 2 3 3 1 2 3 3 3 3 1 3 6 3 2 2
## 87 92 96 97 106 113 118 128 130 136 156 159 166 167 170 171 174 179
## 1 3 4 3 4 3 3 1 2 2 1 5 6 2 2 5 6 4
## 185 191 197 199 200 204 208 209 214
## 2 2 6 1 3 2 6 2 2
##
## Within cluster sum of squares by cluster:
## [1] 27130607 6991191 16265503 8936570 237150514 19871802
## (between_SS / total_SS = 96.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
length(wb.Country)
## [1] 45
length(wb.6.clust.33$cluster)
## [1] 45
cm<-table(wb.Country,wb.6.clust.33$cluster) # Confusion Matrix
cm
##
## wb.Country 1 2 3 4 5 6
## Afghanistan 0 0 0 0 0 0
## Albania 0 0 0 0 0 0
## Algeria 0 0 0 0 0 0
## American Samoa 0 0 0 0 0 0
## Andorra 0 0 0 0 0 0
## Angola 0 0 1 0 0 0
## Antigua and Barbuda 1 0 0 0 0 0
## Argentina 0 0 0 0 0 0
## Armenia 0 0 0 0 0 0
## Aruba 0 0 0 0 0 0
## Australia 0 0 0 0 0 0
## Austria 0 0 0 0 0 0
## Azerbaijan 0 0 1 0 0 0
## Bahamas, The 0 0 0 0 0 0
## Bahrain 0 0 0 0 0 0
## Bangladesh 0 1 0 0 0 0
## Barbados 0 0 0 0 0 0
## Belarus 0 0 0 0 0 0
## Belgium 0 0 0 0 0 0
## Belize 0 0 0 0 0 0
## Benin 0 0 0 0 0 0
## Bermuda 0 0 0 0 0 0
## Bhutan 0 0 0 0 0 0
## Bolivia 0 0 0 0 0 0
## Bosnia and Herzegovina 0 0 1 0 0 0
## Botswana 0 0 1 0 0 0
## Brazil 1 0 0 0 0 0
## British Virgin Islands 0 0 0 0 0 0
## Brunei Darussalam 0 0 0 0 0 0
## Bulgaria 0 0 0 0 0 0
## Burkina Faso 0 0 0 0 0 0
## Burundi 0 0 0 0 0 0
## Cabo Verde 0 0 0 0 0 0
## Cambodia 0 0 0 0 0 0
## Cameroon 0 0 0 0 0 0
## Canada 0 0 0 0 0 0
## Cayman Islands 0 0 0 0 0 0
## Central African Republic 0 0 0 0 0 0
## Chad 0 1 0 0 0 0
## Channel Islands 0 0 0 0 0 0
## Chile 0 0 0 0 0 0
## China 0 0 0 0 0 0
## Colombia 0 0 0 0 0 0
## Comoros 0 0 0 0 0 0
## Congo, Dem. Rep. 0 0 0 0 0 0
## Congo, Rep. 0 0 0 0 0 0
## Costa Rica 0 0 0 0 0 0
## Cote d'Ivoire 0 0 0 0 0 0
## Croatia 0 0 0 0 0 0
## Cuba 0 0 0 0 0 0
## Curacao 0 0 0 0 0 0
## Cyprus 0 0 0 0 0 0
## Czech Republic 0 0 0 0 0 0
## Denmark 0 0 0 0 0 0
## Djibouti 0 0 0 0 0 0
## Dominica 0 0 0 0 0 0
## Dominican Republic 0 0 1 0 0 0
## Ecuador 0 0 1 0 0 0
## Egypt, Arab Rep. 0 0 1 0 0 0
## El Salvador 0 0 1 0 0 0
## Equatorial Guinea 1 0 0 0 0 0
## Eritrea 0 0 0 0 0 0
## Estonia 0 0 0 0 0 0
## Ethiopia 0 0 0 0 0 0
## Faroe Islands 0 0 0 0 0 0
## Fiji 0 0 0 0 0 0
## Finland 0 0 0 0 0 0
## France 0 0 0 0 0 0
## French Polynesia 0 0 0 0 0 0
## Gabon 0 0 0 0 0 0
## Gambia, The 0 0 0 0 0 0
## Georgia 0 0 1 0 0 0
## Germany 0 0 0 0 0 0
## Ghana 0 0 0 0 0 0
## Gibraltar 0 0 0 0 0 0
## Greece 0 0 0 0 0 1
## Greenland 0 0 0 0 0 0
## Grenada 0 0 0 0 0 0
## Guam 0 0 0 0 0 0
## Guatemala 0 0 1 0 0 0
## Guinea 0 0 0 0 0 0
## Guinea-Bissau 0 1 0 0 0 0
## Guyana 0 0 0 0 0 0
## Haiti 0 0 0 0 0 0
## Honduras 0 1 0 0 0 0
## Hong Kong SAR, China 0 0 0 0 0 0
## Hungary 1 0 0 0 0 0
## Iceland 0 0 0 0 0 0
## India 0 0 0 0 0 0
## Indonesia 0 0 0 0 0 0
## Iran, Islamic Rep. 0 0 0 0 0 0
## Iraq 0 0 1 0 0 0
## Ireland 0 0 0 0 0 0
## Isle of Man 0 0 0 0 0 0
## Israel 0 0 0 0 0 0
## Italy 0 0 0 1 0 0
## Jamaica 0 0 1 0 0 0
## Japan 0 0 0 0 0 0
## Jordan 0 0 0 0 0 0
## Kazakhstan 0 0 0 0 0 0
## Kenya 0 0 0 0 0 0
## Kiribati 0 0 0 0 0 0
## Korea, Dem. People\x89۪s Rep. 0 0 0 0 0 0
## Korea, Rep. 0 0 0 0 0 0
## Kosovo 0 0 0 0 0 0
## Kuwait 0 0 0 1 0 0
## Kyrgyz Republic 0 0 0 0 0 0
## Lao PDR 0 0 0 0 0 0
## Latvia 0 0 0 0 0 0
## Lebanon 0 0 0 0 0 0
## Lesotho 0 0 0 0 0 0
## Liberia 0 0 0 0 0 0
## Libya 0 0 1 0 0 0
## Liechtenstein 0 0 0 0 0 0
## Lithuania 0 0 0 0 0 0
## Luxembourg 0 0 0 0 0 0
## Macao SAR, China 0 0 0 0 0 0
## Macedonia, FYR 0 0 1 0 0 0
## Madagascar 0 0 0 0 0 0
## Malawi 0 0 0 0 0 0
## Malaysia 0 0 0 0 0 0
## Maldives 0 0 0 0 0 0
## Mali 0 0 0 0 0 0
## Malta 0 0 0 0 0 0
## Marshall Islands 0 0 0 0 0 0
## Mauritania 0 0 0 0 0 0
## Mauritius 0 0 0 0 0 0
## Mexico 1 0 0 0 0 0
## Micronesia, Fed. Sts. 0 0 0 0 0 0
## Moldova 0 1 0 0 0 0
## Monaco 0 0 0 0 0 0
## Mongolia 0 0 0 0 0 0
## Montenegro 0 0 0 0 0 0
## Morocco 0 0 0 0 0 0
## Mozambique 0 0 0 0 0 0
## Myanmar 0 1 0 0 0 0
## Namibia 0 0 0 0 0 0
## Nauru 0 0 0 0 0 0
## Nepal 0 0 0 0 0 0
## Netherlands 0 0 0 0 0 0
## New Caledonia 0 0 0 0 0 0
## New Zealand 0 0 0 0 0 0
## Nicaragua 0 0 0 0 0 0
## Niger 0 0 0 0 0 0
## Nigeria 0 0 0 0 0 0
## Northern Mariana Islands 0 0 0 0 0 0
## Norway 0 0 0 0 0 0
## Oman 0 0 0 0 0 0
## Pakistan 0 0 0 0 0 0
## Palau 0 0 0 0 0 0
## Panama 0 0 0 0 0 0
## Papua New Guinea 0 0 0 0 0 0
## Paraguay 0 0 0 0 0 0
## Peru 0 0 0 0 0 0
## Philippines 0 0 0 0 0 0
## Poland 1 0 0 0 0 0
## Portugal 0 0 0 0 0 0
## Puerto Rico 0 0 0 0 0 0
## Qatar 0 0 0 0 1 0
## Romania 0 0 0 0 0 0
## Russian Federation 0 0 0 0 0 0
## Rwanda 0 0 0 0 0 0
## Samoa 0 0 0 0 0 0
## San Marino 0 0 0 0 0 0
## Sao Tome and Principe 0 0 0 0 0 0
## Saudi Arabia 0 0 0 0 0 1
## Senegal 0 1 0 0 0 0
## Serbia 0 0 0 0 0 0
## Seychelles 0 0 0 0 0 0
## Sierra Leone 0 1 0 0 0 0
## Singapore 0 0 0 0 1 0
## Sint Maarten (Dutch part) 0 0 0 0 0 0
## Slovak Republic 0 0 0 0 0 0
## Slovenia 0 0 0 0 0 1
## Solomon Islands 0 0 0 0 0 0
## Somalia 0 0 0 0 0 0
## South Africa 0 0 0 0 0 0
## South Sudan 0 0 0 0 0 0
## Spain 0 0 0 1 0 0
## Sri Lanka 0 0 0 0 0 0
## St. Kitts and Nevis 0 0 0 0 0 0
## St. Lucia 0 0 0 0 0 0
## St. Martin (French part) 0 0 0 0 0 0
## St. Vincent and the Grenadines 0 0 0 0 0 0
## Sudan 0 1 0 0 0 0
## Suriname 0 0 0 0 0 0
## Swaziland 0 0 0 0 0 0
## Sweden 0 0 0 0 0 0
## Switzerland 0 0 0 0 0 0
## Syrian Arab Republic 0 0 0 0 0 0
## Tajikistan 0 1 0 0 0 0
## Tanzania 0 0 0 0 0 0
## Thailand 0 0 0 0 0 0
## Timor-Leste 0 0 0 0 0 0
## Togo 0 0 0 0 0 0
## Tonga 0 0 0 0 0 0
## Trinidad and Tobago 0 0 0 0 0 1
## Tunisia 0 0 0 0 0 0
## Turkey 1 0 0 0 0 0
## Turkmenistan 0 0 1 0 0 0
## Turks and Caicos Islands 0 0 0 0 0 0
## Tuvalu 0 0 0 0 0 0
## Uganda 0 0 0 0 0 0
## Ukraine 0 1 0 0 0 0
## United Arab Emirates 0 0 0 0 0 0
## United Kingdom 0 0 0 0 0 0
## United States 0 0 0 0 0 0
## Uruguay 0 0 0 0 0 1
## Uzbekistan 0 1 0 0 0 0
## Vanuatu 0 0 0 0 0 0
## Venezuela, RB 0 0 0 0 0 0
## Vietnam 0 0 0 0 0 0
## Virgin Islands (U.S.) 0 0 0 0 0 0
## West Bank and Gaza 0 1 0 0 0 0
## Yemen, Rep. 0 0 0 0 0 0
## Zambia 0 0 0 0 0 0
## Zimbabwe 0 0 0 0 0 0
Multidimensional scaling (MDS)
# require useful mds
# require(useful)
# plot(wb.2.clust)
# Evaluating model performance
# look at the size of the clusters
wb.6.clust$size
## [1] 6 15 4 2 5 13
# look at the cluster centers
wb.6.clust$centers
## Life.expectancy Infant.mortality Per.capita.income Literacy
## 1 77.60843 7.75000 24383.696 97.62204
## 2 70.74951 25.16000 4968.196 88.90804
## 3 70.98357 26.42500 9449.684 93.86932
## 4 80.62157 4.45000 63777.971 97.00900
## 5 77.46720 5.58000 14498.383 98.71964
## 6 65.50221 40.07692 1525.854 76.24678
names(wb)
## [1] "Life.expectancy" "Infant.mortality" "Per.capita.income"
## [4] "Literacy"
# mean of 'Per.capita.income' by cluster
pci<-aggregate(data = wb, Per.capita.income ~ wb.6.clust$cluster, mean)
pci
## wb.6.clust$cluster Per.capita.income
## 1 1 24383.696
## 2 2 4968.196
## 3 3 9449.684
## 4 4 63777.971
## 5 5 14498.383
## 6 6 1525.854
# mean of 'Literacy' by cluster
l<-aggregate(data = wb, Literacy ~ wb.6.clust$cluster, mean)
l
## wb.6.clust$cluster Literacy
## 1 1 97.62204
## 2 2 88.90804
## 3 3 93.86933
## 4 4 97.00900
## 5 5 98.71964
## 6 6 76.24678
# mean of 'Infant.mortality' by cluster
im<-aggregate(data = wb, Infant.mortality ~ wb.6.clust$cluster, mean)
im
## wb.6.clust$cluster Infant.mortality
## 1 1 7.75000
## 2 2 25.16000
## 3 3 26.42500
## 4 4 4.45000
## 5 5 5.58000
## 6 6 40.07692
# mean 'Life.expectancy' by cluster
le<-aggregate(data = wb, Life.expectancy ~ wb.6.clust$cluster, mean)
le
## wb.6.clust$cluster Life.expectancy
## 1 1 77.60843
## 2 2 70.74951
## 3 3 70.98357
## 4 4 80.62157
## 5 5 77.46720
## 6 6 65.50221
cm.3=table(wb$Per.capita.income,wb.6.clust$cluster)
cm.3
##
## 1 2 3 4 5 6
## 573.0286189 0 0 0 0 0 1
## 693.4080457 0 0 0 0 0 1
## 775.6950905 0 0 0 0 0 1
## 910.788688 0 0 0 0 0 1
## 925.9118877 0 0 0 0 0 1
## 1203.505387 0 0 0 0 0 1
## 1211.701531 0 0 0 0 0 1
## 1843.242802 0 0 0 0 0 1
## 2089.400212 0 0 0 0 0 1
## 2114.954716 0 0 0 0 0 1
## 2132.072442 0 0 0 0 0 1
## 2495.590498 0 0 0 0 0 1
## 2866.800101 0 0 0 0 0 1
## 3614.746766 0 1 0 0 0 0
## 3795.973308 0 1 0 0 0 0
## 3903.490842 0 1 0 0 0 0
## 4102.11859 0 1 0 0 0 0
## 4197.807304 0 1 0 0 0 0
## 4219.35033 0 1 0 0 0 0
## 4629.076634 0 1 0 0 0 0
## 4643.305762 0 1 0 0 0 0
## 4852.657847 0 1 0 0 0 0
## 5137.91553 0 1 0 0 0 0
## 5496.34464 0 1 0 0 0 0
## 6248.110873 0 1 0 0 0 0
## 6360.644776 0 1 0 0 0 0
## 6373.553553 0 1 0 0 0 0
## 6947.840023 0 1 0 0 0 0
## 8538.589975 0 0 1 0 0 0
## 9009.261163 0 0 1 0 0 0
## 9130.026065 0 0 1 0 0 0
## 11120.85799 0 0 1 0 0 0
## 12259.11503 0 0 0 0 1 0
## 12494.46619 0 0 0 0 1 0
## 14128.87855 0 0 0 0 1 0
## 15573.90092 0 0 0 0 1 0
## 18035.55432 0 0 0 0 1 0
## 20444.07859 1 0 0 0 0 0
## 20481.74532 1 0 0 0 0 0
## 20713.07475 1 0 0 0 0 0
## 25831.58231 1 0 0 0 0 0
## 28984.64339 1 0 0 0 0 0
## 29847.04979 1 0 0 0 0 0
## 52888.74467 0 0 0 1 0 0
## 74667.19707 0 0 0 1 0 0
print(wb.6.clust)
## K-means clustering with 6 clusters of sizes 6, 15, 4, 2, 5, 13
##
## Cluster means:
## Life.expectancy Infant.mortality Per.capita.income Literacy
## 1 77.60843 7.75000 24383.696 97.62204
## 2 70.74951 25.16000 4968.196 88.90804
## 3 70.98357 26.42500 9449.684 93.86932
## 4 80.62157 4.45000 63777.971 97.00900
## 5 77.46720 5.58000 14498.383 98.71964
## 6 65.50221 40.07692 1525.854 76.24678
##
## Clustering vector:
## 6 7 13 16 25 26 27 39 57 58 59 60 61 72 76 80 82 85
## 2 5 2 6 2 2 3 6 2 2 2 2 3 2 5 2 6 6
## 87 92 96 97 106 113 118 128 130 136 156 159 166 167 170 171 174 179
## 5 2 1 2 1 2 2 3 6 6 5 4 1 6 6 4 1 1
## 185 191 197 199 200 204 208 209 214
## 6 6 1 3 2 6 5 6 6
##
## Within cluster sum of squares by cluster:
## [1] 97332955 16265503 3921650 237150514 22834895 6991191
## (between_SS / total_SS = 95.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
plot(wb,col=wb.6.clust$cluster) # Plot Clusters
plot(wb[c("Per.capita.income","Literacy")],col=wb.6.clust$cluster)
points(wb.6.clust$centers, col = 1:2, pch = 8)
# Centroid Plot against 1st two discriminant functions
clusplot(wb, wb.6.clust$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)
# library(fpc)
# plotcluster(wb,wb.6.clust$cluster)
K-medoids clustering in R. PAM can handle catagerical data and is robust to outliers
# PAM
k<-4
wb.pam.4.clust<- pam(wb,k, keep.diss = TRUE, keep.data = TRUE)
wb.pam.4.clust
## Medoids:
## ID Life.expectancy Infant.mortality Per.capita.income Literacy
## 13 3 70.76322 27.9 5496.345 99.78936
## 166 31 74.33722 12.5 20481.745 94.42635
## 16 4 71.62590 30.7 1211.702 59.72154
## 171 34 82.64634 2.1 52888.745 96.54015
## Clustering vector:
## 6 7 13 16 25 26 27 39 57 58 59 60 61 72 76 80 82 85
## 1 2 1 3 1 1 1 3 1 1 1 1 1 1 2 1 3 3
## 87 92 96 97 106 113 118 128 130 136 156 159 166 167 170 171 174 179
## 1 1 2 1 2 1 1 1 3 3 1 4 2 3 3 4 2 2
## 185 191 197 199 200 204 208 209 214
## 3 3 2 1 1 3 2 3 3
## Objective function:
## build swap
## 2521.326 2508.203
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
plot(wb.pam.4.clust, which.plots = 2)
# long lines good - means greater within cluster similarity
Gap statistic
clusGap() calculates a goodness of clustering measure, the “gap” statistic. For each number of clusters k, it compares \(\log(W(k))\) with \(E^*[\log(W(k))]\) where the latter is defined via bootstrapping, i.e. simulating from a reference distribution.
maxSE(f, SE.f) determines the location of the maximum of f, taking a “1-SE rule” into account for the SE methods. The default method “firstSEmax” looks for the smallest k such that its value f(k) is not more than 1 standard error away from the first local maximum. This is similar but not the same as “Tibs2001SEmax”, Tibshirani et al’s recommendation of determining the number of clusters from the gap statistics and their standard deviations.
See clusGap
gap<-clusGap(wb,FUNcluster=pam,K.max=10) # Bootstrapping
gap$Tab
## logW E.logW gap SE.sim
## [1,] 11.826708 12.521009 0.6943005 0.07153167
## [2,] 11.261341 11.788692 0.5273507 0.06028033
## [3,] 10.782713 11.336035 0.5533223 0.07707577
## [4,] 10.516580 10.998472 0.4818913 0.07692385
## [5,] 10.171902 10.727707 0.5558048 0.09042793
## [6,] 9.938525 10.502693 0.5641680 0.09130502
## [7,] 9.601981 10.297558 0.6955772 0.09311483
## [8,] 9.332552 10.114634 0.7820818 0.09625402
## [9,] 9.143441 9.942525 0.7990832 0.10137723
## [10,] 8.938730 9.793015 0.8542853 0.10666050
gdf<-as.data.frame(gap$Tab)
head(gdf)
## logW E.logW gap SE.sim
## 1 11.826708 12.52101 0.6943005 0.07153167
## 2 11.261341 11.78869 0.5273507 0.06028033
## 3 10.782713 11.33604 0.5533223 0.07707577
## 4 10.516580 10.99847 0.4818913 0.07692385
## 5 10.171902 10.72771 0.5558048 0.09042793
## 6 9.938525 10.50269 0.5641680 0.09130502
qplot(x=1:nrow(gdf),y=logW,data = gdf,geom="line",color="red")+geom_point(aes(y=logW),color="orange")+geom_line(aes(y=E.logW),color="blue")+geom_point(aes(y=E.logW),color="purple")
# Gap statistic
qplot(x=1:nrow(gdf),y=gap,data = gdf,geom="line",color="red")+geom_point(aes(y=gap),color="orange")+geom_errorbar(aes(ymin=gap-SE.sim,ymax=gap+SE.sim),color="brown")
Hierarchical clustering in R
Hierarchical Clustering for the wholesale customers data set.
spend.h.clust<- hclust(d=dist(spend))
plot(spend.h.clust)
spend.h.clust.si<- hclust(dist(spend), method = "single")
spend.h.clust.co<- hclust(dist(spend), method = "complete")
spend.h.clust.av<- hclust(dist(spend), method = "average")
spend.h.clust.ce<- hclust(dist(spend), method = "centroid")
plot(spend.h.clust.si, labels = FALSE)
plot(spend.h.clust.co, labels = FALSE)
plot(spend.h.clust.av, labels = FALSE)
plot(spend.h.clust.ce, labels = FALSE)
Plotting to deterimine the cluster level.
plot(spend.h.clust, labels = FALSE)
rect.hclust(spend.h.clust, k=3, border="red")
rect.hclust(spend.h.clust, k=5, border="green")
rect.hclust(spend.h.clust, k=9, border="blue")
Hierarchical clustering using centroid clustering and squared Euclidean distance
h_c <- hcluster(spend,link = "ave") # require(amap)
plot(h_c)
plot(h_c, hang = -1)
### centroid clustering and squared Euclidean distance
h_c<- hclust(dist(spend)^2, "cen")
### Cutting the tree into 20 clusters and reconstruct upper part of the tree from cluster center
memb <- cutree(h_c, k = 20)
cent <- NULL
for(k in 1:20){
cent <- rbind(cent, colMeans(spend[,-1][memb == k, , drop = FALSE]))
}
h_c1 <- hclust(dist(cent)^2, method = "cen", members = table(memb))
opar <- par(mfrow = c(1, 2))
plot(h_c, labels = FALSE, hang = -1, main = "Original Tree")
plot(h_c1, labels = FALSE, hang = -1, main = "Re-start from 20 clusters")
par(opar)
Other combinations
## other combinations
h_c <- hcluster(spend,method = "euc",link = "ward", nbproc= 1,
doubleprecision = TRUE)
plot(h_c, hang = -1)
h_c <- hcluster(spend,method = "max",link = "single", nbproc= 2,
doubleprecision = TRUE)
plot(h_c, hang = -1)
h_c <- hcluster(spend,method = "man",link = "complete", nbproc= 1,
doubleprecision = TRUE)
plot(h_c, hang = -1)
h_c <- hcluster(spend,method = "can",link = "average", nbproc= 2,
doubleprecision = TRUE)
plot(h_c, hang = -1)
h_c <- hcluster(spend,method = "bin",link = "mcquitty", nbproc= 1,
doubleprecision = FALSE)
plot(h_c, hang = -1)
h_c <- hcluster(spend,method = "pea",link = "median", nbproc= 2,
doubleprecision = FALSE)
plot(h_c, hang = -1)
h_c <- hcluster(spend,method = "cor",link = "centroid", nbproc= 1,
doubleprecision = FALSE)
plot(h_c, hang = -1)
h_c <- hcluster(spend,method = "spe",link = "complete", nbproc= 2,
doubleprecision = FALSE)
plot(h_c, hang = -1)
An Introduction to Statistical Learning with Applications in R (2013) Authors: Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani Free online via SpringerLink (http://link.Springer.com/) http://link.springer.com/book/10.1007/978-1-4614-7138-7
UC Irvine Machine Learning Repository
K-means clustering is not a free lunch